Day 1 AM: Introduction to R, RStuido and the data.frame

Using R and RStudio

R is a flexible language that is specialized for data analysis and visualization. This workshop focuses on tabular data that can be loaded into an R data.frame for exploratory analysis and visualization. Other aspects of R, such as general purpose programming, modeling for statistical inference and use of BioConductor for specialized assay analysis are de-emphasized in this workshop.

Most people using R use it in the context of the RStudio graphical user interface (GUI) environment, and we introduce this environment to illustrate:

  • The anatomy of RStudio
  • The R console
  • Writing, executing and “sourcing” R scripts
  • Using R markdown and notebooks for literate programming
  • Getting help
RStudio screenshot

RStudio screenshot

Overview of the exploratory data analysis pipeline

The exploratory data analysis pipeline typically consists of the following steps:

  • Converting messy data into tidy data
  • Manipulating tidy data
  • Visualizing tidy data

These actions are generally performed using the tidyverse meta-package. We will cover the use of tidyverse and these stages in reverse order in this workshop since the first two stages are quite dry without setting up the correct motivation. First however, we cover some essential concepts and show how data is loaded in the first place.

In [80]:
library(tidyverse)
Warning message:
“Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
Please reinstall dplyr to avoid random crashes or undefined behavior.”Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Warning message:
“package ‘dplyr’ was built under R version 3.4.1”Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Types, collections and variable assignments

Strings

In [1]:
"This is a string"
'This is a string'
In [2]:
substr("This is a string", 6, 10)
'is a '
In [3]:
paste("gene", 1:10)
  1. 'gene 1'
  2. 'gene 2'
  3. 'gene 3'
  4. 'gene 4'
  5. 'gene 5'
  6. 'gene 6'
  7. 'gene 7'
  8. 'gene 8'
  9. 'gene 9'
  10. 'gene 10'
In [4]:
paste("Hello", "world", sep=", ")
'Hello, world'

Numbers

In [5]:
42
42
In [6]:
3.14
3.14
In [7]:
0.5 + 0.5i
0.5+0.5i

Boolean values

In [8]:
TRUE
TRUE
In [9]:
2 > 3
FALSE

Factors

In [10]:
sex <- as.factor(c("M", "F"))
In [11]:
sex
  1. M
  2. F
In [12]:
str(sex)
 Factor w/ 2 levels "F","M": 2 1

Missing values

In [13]:
NA
<NA>
In [14]:
4 * NA
<NA>

Vectors

In [15]:
5:10
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
In [16]:
10:5
  1. 10
  2. 9
  3. 8
  4. 7
  5. 6
  6. 5
In [17]:
c(1,1,2,3,5,8)
  1. 1
  2. 1
  3. 2
  4. 3
  5. 5
  6. 8
In [18]:
seq(1, 10, by=3)
  1. 1
  2. 4
  3. 7
  4. 10
In [19]:
rep(1:4, 2)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 1
  6. 2
  7. 3
  8. 4
In [20]:
rep(1:4, each=2)
  1. 1
  2. 1
  3. 2
  4. 2
  5. 3
  6. 3
  7. 4
  8. 4
In [21]:
rnorm(5, 100, 15)
  1. 102.594622335355
  2. 111.635958574435
  3. 103.443624575272
  4. 106.529771804924
  5. 85.6990987716058
In [22]:
sample(c("H", "T"), 5, replace = TRUE)
  1. 'T'
  2. 'H'
  3. 'H'
  4. 'T'
  5. 'H'

Matrices

In [23]:
matrix(1:12, nrow=4)
1 5 9
2 6 10
3 7 11
4 8 12
In [24]:
matrix(1:12, nrow=4, byrow=TRUE)
1 2 3
4 5 6
7 8 9
101112

Lists

In [25]:
list(a=1, b=2)
$a
1
$b
2
In [26]:
list(a=5:10, b= 10:5)
$a
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
$b
  1. 10
  2. 9
  3. 8
  4. 7
  5. 6
  6. 5

Assignment

In [27]:
greet <- "hello"
In [28]:
greet
'hello'
In [29]:
my.vec <- 5:10
In [30]:
my.vec
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
In [31]:
my.list <- list(a=5:10, b= 10:5)
In [32]:
my.list
$a
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
$b
  1. 10
  2. 9
  3. 8
  4. 7
  5. 6
  6. 5
In [33]:
my.matrix <- matrix(1:12, nrow=4, byrow=TRUE)
In [34]:
my.matrix
1 2 3
4 5 6
7 8 9
101112

Indexing

Vectors

In [35]:
my.vec
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
In [36]:
my.vec[1]
5
In [37]:
my.vec[-1]
  1. 6
  2. 7
  3. 8
  4. 9
  5. 10
In [38]:
my.vec[-c(1,3)]
  1. 6
  2. 8
  3. 9
  4. 10
In [39]:
my.vec[2:4]
  1. 6
  2. 7
  3. 8

Lists

In [40]:
my.list
$a
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
$b
  1. 10
  2. 9
  3. 8
  4. 7
  5. 6
  6. 5
In [41]:
my.list$a
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
In [42]:
my.list[1]
$a =
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
In [43]:
my.list[[1]]
  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10

Matrices

In [44]:
my.matrix
1 2 3
4 5 6
7 8 9
101112
In [45]:
my.matrix[2,3]
6
In [46]:
my.matrix[2,]
  1. 4
  2. 5
  3. 6
In [47]:
my.matrix[,3]
  1. 3
  2. 6
  3. 9
  4. 12
In [48]:
my.matrix[2:3, 2:3]
56
89

Getting data into a data.frame

Preloaded data.frame

R preloads several data sets that are often used as examples in R tutorials. To find out what these are, enter

library(help="datasets")
In [58]:
head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
In [59]:
head(faithful)
eruptionswaiting
3.60079
1.80054
3.33374
2.28362
4.53385
2.88355

Creating a data.frame from scratch

A data frame is just a collection of lists of the same length, where each list contains only one type of variable, is treated as a column.

In [56]:
n <- 8
my.df <- data.frame(pid=1:n,
                    sex=as.factor(sample(c("M", "F"), n, replace = T)),
                    iq=round(rnorm(n, 100, 15), 0))
In [57]:
my.df
pidsexiq
1 M 110
2 F 104
3 F 65
4 M 106
5 F 89
6 F 95
7 M 129
8 M 96

Loading from CSV or other tablular file

In [69]:
url <- "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/Titanic.csv"
titanic <- read.csv(url)
In [70]:
head(titanic)
XNamePClassAgeSexSurvivedSexCode
1 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1
2 Allison, Miss Helen Loraine 1st 2.00 female 0 1
3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25.00 female 0 1
5 Allison, Master Hudson Trevor 1st 0.92 male 1 0
6 Anderson, Mr Harry 1st 47.00 male 1 0

We can aslo download and read in as local file

In [71]:
download.file(url = url, destfile="titanic.csv")
In [72]:
titanic.1 <- read.csv("titanic.csv")
In [73]:
head(titanic.1)
XNamePClassAgeSexSurvivedSexCode
1 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1
2 Allison, Miss Helen Loraine 1st 2.00 female 0 1
3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25.00 female 0 1
5 Allison, Master Hudson Trevor 1st 0.92 male 1 0
6 Anderson, Mr Harry 1st 47.00 male 1 0

Understanding the data.frame

Size

In [74]:
dim(titanic)
  1. 1313
  2. 7

Structure

In [75]:
str(titanic)
'data.frame':   1313 obs. of  7 variables:
 $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Name    : Factor w/ 1310 levels "Abbing, Mr Anthony",..: 22 25 26 27 24 31 45 46 50 54 ...
 $ PClass  : Factor w/ 4 levels "*","1st","2nd",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Age     : num  29 2 30 25 0.92 47 63 39 58 71 ...
 $ Sex     : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 1 2 ...
 $ Survived: int  1 0 0 0 1 1 1 0 1 0 ...
 $ SexCode : int  1 1 0 1 0 0 1 0 1 0 ...

Top rows

In [77]:
head(titanic, n=4)
XNamePClassAgeSexSurvivedSexCode
1 Allen, Miss Elisabeth Walton 1st 29 female 1 1
2 Allison, Miss Helen Loraine 1st 2 female 0 1
3 Allison, Mr Hudson Joshua Creighton 1st 30 male 0 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1

Bottom rows

In [78]:
tail(titanic, n=2)
XNamePClassAgeSexSurvivedSexCode
13121312 Lievens, Mr Rene3rd 24 male 0 0
13131313 Zimmerman, Leo 3rd 29 male 0 0

Random rows

In [81]:
sample_n(titanic, 4)
XNamePClassAgeSexSurvivedSexCode
10191019 Miles, Mr Frank 3rd NA male 0 0
243 243 Spedden, Master Robert Douglas1st 6 male 1 0
934 934 Kink, Miss Louise Gretchen 3rd 4 female 1 1
12191219 Smiljanovic, Mr Mile 3rd NA male 0 0

Indexing

Since the data.frame is fundamentally a list of columns and similar to a matrix, we can index using list or matrix notation.

In [82]:
titanic$Name[1:4]
  1. Allen, Miss Elisabeth Walton
  2. Allison, Miss Helen Loraine
  3. Allison, Mr Hudson Joshua Creighton
  4. Allison, Mrs Hudson JC (Bessie Waldo Daniels)
In [85]:
titanic[1:5, 3]
  1. 1st
  2. 1st
  3. 1st
  4. 1st
  5. 1st

Exporting a data.frame

In [88]:
write.csv(titanic, "my_titanic.csv", row.names = FALSE)
In [95]:
list.files(".", "*.csv")
  1. 'my_titanic.csv'
  2. 'titanic.csv'
In [96]:
titanic.2 <- read.csv("my_titanic.csv")
In [98]:
head(titanic.2, n=3)
XNamePClassAgeSexSurvivedSexCode
1 Allen, Miss Elisabeth Walton 1st 29 female 1 1
2 Allison, Miss Helen Loraine 1st 2 female 0 1
3 Allison, Mr Hudson Joshua Creighton1st 30 male 0 0

Installing packages from CRAN and BioConductor

Install from CRAN

Simplest is to use the menu item in RStudio, but you can also do this from the console.

In [100]:
install.packages("pwr", repos="http://cran.us.r-project.org")

The downloaded binary packages are in
        /var/folders/3l/tbmzdkss71152d8t9n1f8nx40000gn/T//Rtmpmv86yS/downloaded_packages

Install from BioConductor

In [101]:
source("https://bioconductor.org/biocLite.R")
biocLite("ggbio")
Bioconductor version 3.5 (BiocInstaller 1.26.0), ?biocLite for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.5 (BiocInstaller 1.26.0), R 3.4.0 (2017-04-21).
Installing package(s) ‘ggbio’

The downloaded binary packages are in
        /var/folders/3l/tbmzdkss71152d8t9n1f8nx40000gn/T//Rtmpmv86yS/downloaded_packages
Old packages: 'agricolae', 'AnnotationDbi', 'Biostrings', 'boot', 'bsseq',
  'ChAMP', 'cowplot', 'curl', 'devtools', 'dplyr', 'FSA', 'GGally', 'git2r',
  'igraph', 'limma', 'mgcv', 'modelr', 'plotly', 'purrr', 'sandwich',
  'stringdist', 'VGAM', 'withr'
In [ ]: